Spontaneous genetic clustering in populations of competing organisms 
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We introduce and analyse an individual-based evolutionary model, in which a population of 
genetically diverse organisms compete with each other for limited resources. Through theoretical 
analysis and stochastic simulations, we show that the model exhibits a pattern-forming instability 
which is highly amplified by the effects of demographic noise, leading to the spontaneous formation 
of genotypic clusters. This mechanism supports the thesis that stochasticity has a central role in 
the formation and coherence of species. 



I. INTRODUCTION 

The development of a quantitative theory of specia- 
tion is of fundamental biological importance, however, 
the complex relationships between the various mecha- 
nisms at work make this enterprise fraught with difhcul- 
ties. The analysis of simple mathematical models of evo- 
lutionary dynamics can provide invaluable insight, par- 
ticularly as a tool to distinguish necessary and sufficient 
conditions for the formation of new species. In recent 
years there has been considerable interest in the possi- 
bility of sympatric speciation driven by competition for 
resources. The mathematical formulation of this prob- 
lem dates back to MacArthur and Levins although 
similar models have been proposed by many others [2|t31 ■ 
The robustness of this mechanism of speciation has been 
called into question, however, as species may or may not 
form depending on the precise details of how the effects 
of competition are modelled [sl-fioj . 

Traditionally, many mathematical models of evolution- 
ary processes are formulated at a macroscopic level, de- 
scribing the dynamics of entire populations and neglect- 
ing the effects of intrinsic demographic noise. In a recent 
study [ni we analysed an individual-based (i.e., micro- 
level) stochastic model of competition between individual 
organisms. The model reduces to the usual population- 
level equations in the limit of infinite population size 
but, crucially, for finite populations we found that spe- 
ciation is dramatically enhanced by the effects of demo- 
graphic noise. This observation serves firstly to show that 
competition-driven speciation is in fact far more robust 
an effect than is suggested by a deterministic analysis. 
Secondly, it illustrates the need to take the individual 
nature of the organisms into account when modelling spe- 
ciation. 

The models discussed above all relate to phenotypic 
speciation, where phenotypes are represented by a nu- 
merical value in a one-dimensional 'niche-space' j2^ . 
This is, of course, an over-simplification, and care should 
be taken in drawing conclusions on the basis of such mod- 
els. In this paper, we introduce and analyse a genetic 
counterpart to the individual-based model studied in [ll| , 



with the purpose of investigating speciation from a differ- 
ent, and complementary, viewpoint. Our model consists 
of a population of organisms which are characterised by 
their "genomes" , which we model as binary sequences. 
The organisms reproduce (with mutation) and die due 
to competition between individuals, where competition 
is strongest between organisms with similar genomes. 

We are interested in studying the formation of species, 
which we interpret as well-separated clusters of geneti- 
cally similar organisms. One immediate problem which 
arises is the question of how to detect such clusters from 
the genetic data. This is precisely the problem faced 
by biologists seeking to classify organisms using genetic 
sequencing and many methods exist [T^ . [l3| . For our 
analysis, we choose to study the distribution of genetic 
distance between pairs of randomly selected organisms. 
This statistical measure is sufficient to determine if ge- 
netic clusters have formed, as well as being amenable to 
theoretical analysis. Moreover, closely related measures 
have already been employed in experimental genetics, for 
example 

Starting from the individual based stochastic model, 
we perform a systematic expansion in population size, 
providing a mathematical description of the model at 
three levels. In the limit of infinite populations we recover 
a deterministic system of differential equations for the 
frequency of different genomes in the population. Anal- 
ysis of this systems reveals a pattern-forming instability 
which may be interpreted as describing the formation of 
disjoint genetic clusters, that is, the formation of species. 
For large but finite populations, a linear noise analysis 
shows that demographic stochasticity acts to enhance 
the clustering process, leading to quasi-clusters in an 
otherwise homogeneous population. Finally, a full (non- 
perturbative) analysis is possible in the neutral case of 
global competition. Depending on the scaling relation- 
ship between mutation rate and population size, we find 
that demographic noise can lead to the population spon- 
taneously forming sharply delineated genetic clusters. 

The paper is organised as follows. The model is de- 
fined in the next section, after which Section III deals 
with the mathematical formulation of the model in terms 
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of a Langevin equation defined in a high-dimensional 
space. The theoretical analysis of deterministic, weak- 
and strong-noise effects is presented in Section IV, along 
with comparisons to simulation data. In Section V we 
conclude with a discussion of our findings. There are two 
technical appendices. 



II. A GENETIC MODEL OF COMPETITION 

In this section we introduce an individual-based 
stochastic model of a competing population. The ge- 
netics of this population are modelled by using binary 
sequences of length N; in our biological analogy, a zero 
or a one at a given point in this sequence tells us which of 
two possible gene variants (alleles) is present at that locus 
[25} . An individual organism is specified by its genome or, 
more conveniently, by the list of type-one alleles it pos- 
sesses. For example, the 8-bit genome (0, 1, 1, 0, 0, 0, 1, 0) 
corresponds to the set {2, 3, 7}. At time t in the model, 
there are Af{t) living organisms, with genomes labelled 
by sets h, . . . , I^(t). 

It is also necessary to define a notion of genetic sim- 
ilarity between organisms. We choose to measure the 
distance between two genomes by counting the number 
of entries they have in common, known as the Hamming 
distance [Tsj . This is a standard approach in quantifiy- 
ing genetic distance in experimental studies, for example 
[14| . The Hamming distance between genomes labelled 
with sets / and J is equal to the number of elements ap- 
pearing in one of / or J, but not both. We will use the 
notation lQj = {n : n G lUJ and n ^ / n J } , so that 
the Hamming distance between / and J may be written 
| / G J|, where | • | denotes the cardinality of the set. 

Each organism reproduces asexually with the same 
constant rate. We choose our timescale so as to set this 
rate to one. The genome of the offspring is cloned from 
that of the parent, with the possibility of some muta- 
tion: each point in the gene sequence has a probability 
fi of hipping between and 1. In addition, we assume 
that all possible genomes are viable, so all reproduction 
events result in the addition of an organism to the pop- 
ulation. The rate with which an organism with genome 
/ gives birth to one with genome J is thus 

i?,, = ^l^eJ|(l-;.)^^-l^e.|. (1) 

Deaths in our model result from competitive interac- 
tions between the organisms. In phenotypic models of 
competition, it is assumed that individuals with simi- 
lar phenotypes are likely to exploit their environment in 
similar ways, and will thus compete more with each other 
than with organisms whose phenotypes are very different. 
We apply the same convention to our genotypic model, 
with the assumption that the map between genotype and 
phenotype is sufficiently simple that we may treat com- 
petition as a function of genetic similarity. We define the 
strength of competition between organisms with genomes 



/ and J to be a function of their Hamming distance: 

Gu^9i\lQJ\). (2) 

The function g is chosen to be decreasing (so that com- 
petition strength declines with genotypic distance), and 
normalised according to 

This particular choice of normalisation is made in order 
to simplify the expression for the overall carrying capac- 
ity of the system, as will be made clear later. 

The death rate of an organism n at time t is given by 
the total competition it experiences, multiplied by a con- 
stant K. This parameter controls the carrying capacity: 
when K is large, competition is fierce and only a few or- 
ganisms can coexist; when it is small, death rates are low 
and the population grows large. In fact this relationship 
is rather precise; it can be seen from both the simulations 
and theory that the total population is typically close to 

The birth and death rates defined above specify the dy- 
namics of the model. Starting from an initial seed popu- 
lation consisting of Af{0) = 1/k organisms with uniformly 
randomly assigned genomes, we allow the processes of re- 
production and competition to shape the population. For 
numerical simulations this is achieved using Gillespie's 
algorithm [l6| . 

III. MATHEMATICAL FORMULATION 
A. Master equation 

We now embark on a theoretical analysis of the be- 
haviour of our model of genetic competition. The first 
step is to formulate the model in the standard way as a 
Markov process described by a master equation |17| . 

At time t, we specify the state of the system by a vector 
X with entries indexed by the subsets of {1, ... , N}. The 
entry x/ gives the (scaled by k) number of organisms 
with genome /: 

^^(t) 

XI ^ 5l^J . 

n=l 

Our analysis concerns the time evolution of the distribu- 
tion P{x, t), giving the probability of finding the system 
in state x at time t. To determine the rate of change of 
P in time, we must consider contributions coming from 
the two processes which alter the system state - birth 
and death. 

The birth of an organism with genome / alters the state 
of the system through the addition of n to xj. The rate 
Bj with which this event occurs is found by summing 
the birth rate of all existing organisms (which we have 
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set equal to unity) multiplied by the probability of the 
offspring being suitably mutated to have genome /. That 
is, 

AA(t) ATit) ^ 

Bi = y] Rii„ = rijSi,,,j = - v ruxj . 

n—l J n—1 J 

The death rate of an organism with genome / is given by 
the sum of the competition between itself and the other 
organisms, multiplied by k. Multiplying this quantity by 
the number of organisms with that genome (i.e. xj / n) 
gives a total death rate of 

A/-(t) ^ 

Di = Gn„ = - ^ xiGijxj . 



Combining the effects of these two processes, we may 
write the master equation as [13] 



I, J 



RijxjP 



(4) 



GijxjxjP 



where £f is a step operator which alters its argument 
through the addition of ±k to xj. 



B. Kramers-Moyal expansion 

We are interested in the limit of small k, in which the 
effect of competition is weak and hence the population 
grows large. In this regime we approximate P by a con- 
tinuous probability distribution V, and expand the step 
functions in their Taylor series: 



dx\ 



(5) 



Applying this expansion to the master equation Q and 
truncating at i = 2 yields the non-linear Fokker-Planck 
equation [18|] 



I.J 

+ {Rijxj + Gijxi3 



(6) 



For our purposes, it will be more convenient to work with 
the equivalent Langevin equation (using the Ito formal- 



ism) [3: 

TT =2^ [Rijxj - xiGijxj ) 



dt 



{Rijxj + xiGijxj) 



1/2 



(7) 



mit) 



where the r?/(t) are independent Gaussian white noise 
variables with zero mean and unit variance, that is, 
(77/(i)77j(<')) = 5{t — t')Sij. Here, and hereafter, we use 
(• • • ) to denote averaging over the noise. 

It is worth pausing for a moment at this stage to dis- 
cuss the precise sense in which equation ([7]) describes the 
behaviour of our original microscopic stochastic model. 
The astute reader may be concerned by the fact that we 
treat the xj as continuous stochastic variables, when in 
reality the large number of possible genomes means that 
most xi will be exactly zero, with only a few taking val- 
ues K, 2k, etc. The explanation is that, although P and 
V take different arguments (one discrete, the other con- 
tinuous), their first and second order moments agree up 
to 0{k^). The error committed in this approximation 
was rigorously bounded by Kurtz [l^; as we will see, it 
is quite sufficient for our purposes. 



C. An orthogonal basis 

The simulations presented in later sections are taken 
from a model with an A'^ = 32 bit genome. Even with 
this relatively low number of loci, the system ([T]) has 
some 4,294,967,296 dimensions. Care needs to be taken 
to arrive at analytical results which are computationally 
tractable. The first simplifying step we take is to change 
basis with the aim diagonalising the mutation and com- 
petition matrices R and G, defined in ([T]) and 

We will be making use of the discrete Fourier transfor- 
mation on the space of binary sequences. To do this, we 
introduce the matrix II/j = (—1)1^^-^! and the transfor- 
mation 



(8) 



Using Eq. (IA2[) of Appendix A, the inverse transforma- 
tion is 



(9) 



The most useful property of the matrix 11 is that it di- 
agonalises Hamming-distance invariant functions. Gen- 
erally, if F is a matrix with entries Fjj = f{\I Q J\) 
for some function /, then HFH is diagonal. Proof of this 
fact is given in Appendix A. Both R and G matrices have 
this property and so are completely characterised by the 
quantities 
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Hi?H 



HGH 



(10) 
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for / C {!,... ,N}. It is shown in Appendix A that 
Pj ~ {1 — 2/i)l^l, which impUes that P0 — 1 for all /i. It 
also follows that 

J 

and so 70 =2~^ g(|J|), since the entries of 110^ j are 
all equal to unity. Fixing 70 specifies the normalisation 
of the competition kernel. A convenient choice is 70 = 1 
which, since there are (^) genomes with \J\ = n, gives 
the normalisation specified in Eq. (|3]). 

The useful properties of this transform motivate a 
change of variables from x to y = x. Carrying this out 
in Eq. ([7|) we arrive at the Langevin equation 

^ = yi PI -^yjyieJiiej + VKdit) ■ (n) 

Here the Ciit) are Gaussian noise variables with correla- 
tions 

(C/WO(i')) 

= S{t- t') '^Uik{Rklxl + xiGklxl)^kj 

K,L 

= S{t-t')i yieJPieJ + ^yK yieJeKlieJeK j ■ 

(12) 

Equations (jlip and ([T2l) will form the starting point for 
our analysis of the behaviour of the system. 

IV. ANALYSIS 

A. Deterministic dynamics 

We first consider the behaviour of the model in the 
limit of very large population sizes, with mutation 
strength held constant. This corresponds to taking k, — ?> 
0, in which case the Langevin equation (ITT]) reduces to 
the deterministic system 

^ ^ yi PI "^yjyieJiiej ■ (13) 

J 

Now, since ^0 = 70 = 1, we find that the deterministic 
equation has a fixed point at yi = 80 j. In the origi- 
nal variables, this corresponds to xj — 2^^ for all /; that 
is, the organisms are spread homogeneously throughout 
the genetic space. We denote by A the Jacobian matrix 
of at this fixed point, whose entries are 

Aij ^Sij{pi -ji -1) . (14) 

The homogeneous fixed point is therefore stable if and 
only if p/ — 77 < 1 for each /. The boundary of the 
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FIG. 1: Phase diagram showing the stability of the homo- 
geneous state of the 32-bit genome model in the determinis- 
tic limit K — >■ 0. The parameters w and /i on the horizontal 
and vertical axes respectively control the width of the top-hat 
competition kernel and the strength of mutation. Numbered 
dots show the parameter values used for simulations appear- 
ing in later figures (with corresponding figure numbers). 

stability of the homogeneous state is determined by the 
balance between the strength of mutation and the shape 
of competition kernel. To illustrate this, we consider a 
particular choice of kernel with a 'top-hat' shape param- 
eterised by the width we [0, TV]. Let 

J l/.gu, if n<w 
g(n) = < 

I otherwise, 

where is the normalisation constant enforcing (jS]). 

Figure[l]shows the phase diagram in this case with axes 
for mutation strength p, and kernel width w. The unusual 
sawtooth shape of the boundary may be attributed to the 
geometry of sequence space, since the overlap between 
top-hat competition kernels (i.e. spheres) depends on 
their parity. The packing of spheres in sequence space 
is itself a difficult problem in information theory, with 
roots going back to the seminal work of Hamming on 
error-correcting codes [isl ] . 

What behaviour will the model exhibit in the unstable 
regime? If the system is unstable in direction yj then 
the population density variables xj will each either be 
exponentially enhanced or suppressed, according to the 
sign of H/j. This is a pattern- forming instability in di- 
rect analogue with those occurring in spatial systems Q 
and on networks (20j . Once a pattern has formed, some 
clusters of genomes will be very common amongst the 
population while others are totally absent. This process 
can be thought of as describing the formation of species: 
the population has split into several groups which are 
genetically isolated from each other. 

We should point out that not all choices of kernel will 
result in a pattern-forming transition in the deterministic 
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dynamics. From the earlier stability analysis, we see that 
if 7/ > for all /, then the homogeneous state is always 
stable and species cannot form. It has been noted that 
this condition is quite restrictive as it appears to rule 
out simple explanations such as the overlap of resource 
consumption being responsible for species formation [2lj . 
However, we will see that the effects of demographic noise 
are powerful enough to override this analysis. 



B. Weak noise effects 

In our previous work on phenotypic competition, we 
found that demographic noise strongly affected the for- 
mation of species HlJ . It is natural to ask if the same is 
true in the present genome-based model. 

As a first approximation, we look for small stochastic 
corrections to the deterministic system. Suppose we are 
in the situation that the homogeneous state yi = 60 j is 
stable in the deterministic dynamics ((T5)) . We linearise 
the Langevin equation (jlip around this state, introducing 
the change of variables zj = (yj — 61^0) /^/k. Keeping 
only the lowest order terms in k we arrive at the linear 
stochastic differential equation 



dz 
It 



(15) 



where A is the Jacobian matrix defined in ([T4|) and ^{t) 
is a vector of independent white noise variables with unit 
variance. This is an Ornstein-Uhlenbeck process, whose 
general solution is known [17j . 

For our purposes, we are mainly interested in the be- 
haviour of the correlations between variables. Let us de- 
fine the shorthands Yjj = (yiyj) and Zjj = (zizj). A 
standard result [l3| states that if z satisfies (fT5|) . then 
for Z we have 

— =AZ + ZA'^ + 21, 
at 

where I denotes the identity matrix of size 2^. Since in 
our case the matrix A is diagonal, the dynamics of the 
Zjj are independent of one another and can be solved 
easily. In particular, in the long time limit we find 



Zij ^ Si^, 



1 



1 + 7/ - p/ 
and thus, changing back to y variables, 

Yij ^ Sij (Sj, 



l + li - pi 



(16) 



The (5/ part in the above equation comes the deter- 
ministic part of the y variables, and the second term the 
stochastic corrections, which are of order ^/k. 

It is not immediately obvious from (jl6p what qualita- 
tive difference to the genetics of the population will result 
from this stochastic term. To help answer this question. 



we investigate the distribution of genetic (Hamming) dis- 
tance between randomly selected organisms. For each 
n € {0,...,A^}, define 



S(n) 



k,l 



eh 



\Ie.J\,nXlXj . 



I, J 



(17) 



If two organisms are selected at random from the popula- 
tion, S(n) gives the probability that their genomes differ 
in n loci. It is straightforward to compute that at the 
deterministic fixed point xi = the shape of 2 is a 
symmetric binomial distribution: S(n) = 2~''^(^). 

The calculation of the covariance of Fourier variables 
y gives sufficient information to compute the long-time 
average form of S(n) in the presence of noise. From 
Eq. dm): 



I, J 



(18) 



1 fN 



I, J 



where (• • • )oo refers to averaging over the stationary dis- 
tribution. The second equality comes from the symmetry 
between genomes, meaning that we may choose to study 
the pair 1 = and J = {1, . . . , n}, which is representa- 
tive of all 2^ (^) pairs of Hamming distance n. 

In the regime of weak noise, the long-time behaviour 
of the correlation function is given by Eq. ()16p . Using 
this result in Eq. (fT8|) gives 



+ 7/ - P/ 



(19) 



which clearly shows the deterministic result plus the or- 
der K stochastic correction. 

A typical example of weak noise affecting the distri- 
bution of Hamming distances is shown in Fig. [21 The 
theoretical prediction from Eq. (|19p is compared with 
data gathered from simulations, averaged over 100 sam- 
ples. We have chosen a top-hat competition kernel, the 
phase diagram for which is given in Fig. [TJ The param- 
eters w = 30, /i = 2^^^, K = 10~^ are well within the 
region of stability for the homogeneous state, meaning 
that the deterministic theory predicts that the distribu- 
tion of pairwise Hamming distance should be binomial. 
As is visible in Fig. [21 there is a significant noise-induced 
deviation: the distribution is skewed to the left, that 
is, randomly selected organisms often have more genetic 
data in common than one would expect. Demographic 
noise is causing the formation of genotypic clusters. 



C. Strong noise effects 

Moving beyond weak noise effects, we can make further 
analytic progress by considering the paradigmatic 'neu- 
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FIG. 2: Distribution of pairwise Hamming distance as mea- 
sured from simulations in the weak noise regime (black circles) 
and predicted by the theory (blue/grey area). The binomial 
distribution predicted by the deterministic theory is shown 
for comparison (line). Parameters here are w — 30, fi — 2~^^ , 
K — 10""', and the simulation result was the averaged over 
100 samples at taken time t — 1000. 



tral' case in which competition strength is independent of 
genetic distance and thus all organisms have equal fitness. 
This is a special case of the top-hat kernel we considered 
earlier, with w — N and thus the homogeneous state is 
stable for all values of the mutation coefficient /i. 

Changing basis, g{n) = 1 gives 77 — 61^0, and thus the 
Langevin equation for the y variables simplifies to 

^ = yi{pi~y^) + V^Ci{t), (20) 



where 

(C/(i) 0{t')) = - t') Viej{pieJ + V0) 



(21) 



Notice that the dynamics of 2/0 are separated from those 
of the other variables: we have 



where 



^ = 2/0(1 - 2/0) + VkC0(O : 



(C0W') = y0(i + y0)- 



This equation describes noisy logistic growth, and the 
long-time quasi-stationary distribution was computed in 
Unsurprisingly, as k — 0, the distribution of 2/0 
approaches a delta function centred on one. 

We can exploit this fact mathematically through the 
use of adiabatic elimination, setting 2/0 = 1 and thus 
C,0{t) = 0. In Appendix B we derive general expressions 
for conditioned stochastic differential equations, which 
can be applied here to give for /, J 7^ 

dyi 
dt 

where now 

{mcj{t')) = 

^0 {^iej{piej + ij - yiyj ^ 

(23) 



yi {pi - 1) + 



(22) 



{pi + 1) {PJ 




Comparing this expression to (|21l) we see that condition- 
ing on the value of C0 results in an anti-correlation be- 
tween the other noise variables which previously was not 
present. This will act to enhance the formation of certain 
patterns of clusters. 

We are now in a position to evaluate the dynamics of 
the moments of the remaining degrees of freedom. From 
Eq. ([22]) , we have 



djyi) 

dt 



{Pi-i){yi), 



for / 7^ 0. Since pi < 1 for all /, each (j//) undergoes 
exponential decay. Earlier we specified that the genomes 
of the initial 'seed' population are randomly assigned, we 
thus deduce that the relation 



(24) 



holds throughout. Moving on to examine the covariance 
structure, we employ Ito's lemma [Hj to obtain the fol- 
lowing equation for Yjj = (yiyj) with I, J ^ 0: 



^ = L + p.j-2-^{pj + l){pj + l)]Yu 



+ K{piej + l)(y/e,/) 



(25) 



Substituting for (yiej) using Eq. (|24|). we find that the 
only non-zero contributions to Yf j arise when IQ J = 0, 
that is, when I — J. Solving Eq. (|25|) . we find that in 
the long-time limit 



Yij ^ S_ 



Pi 



PI -I 



Recalling that pi — [1 — 2/i)l-^l, we observe that the scale 
of Yji is determined by the relationship between compe- 
tition strength k and mutation rate /i. In the two limiting 
cases; we have Yu = when k = , 7^ and Yu = 1 
when K 7^ , /i = 0. 

We can explore the range between these extremes by 
taking the limit k — and /i — with r = k/2/x fixed. 
Biologically, this corresponds to the joint scaling in which 
populations are very large and mutations very rare, but 
the total number of mutations per generation occurring 
in the whole population remains approximately constant. 
In this case the above equation simplifies to 



Yij ^ 61J- 



(26) 



To make a prediction about the presence or absence of 
species formation, we compute the distribution of Ham- 
ming distance for this case. Inserting the result (j26|) into 
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isms. Species have formed. 

Whilst the agreement between simulations and theory 
for the average distribution of pairwise Hamming dis- 
tance is excellent, we should point out that the measured 
distributions vary greatly from one simulation run to the 
next. We demonstrate this in the figure by plotting the 
results of the first simulation in both samples. The pres- 
ence of multiple peaks implies the formation of several 
disjoint species. 
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n 

FIG. 3: Distribution of pairwise Hamming distance as mea- 
sured from simulations in tlie strong noise regime (black cir- 
cles) and predicted by the theory (blue/grey area). The pa- 
rameter values are k = 10""^ for both plots, r = 1 in the 
upper plot and r = 4 in the lower. The simulation result was 
the averaged over 1000 samples at taken time t = 10000. The 
unaveraged data is highly random; in both plots the dark line 
shows the distribution of pairwise Hamming distance mea- 
sured from the first simulation in the sample. 

equation (jl8p we obtain 

^±y'y ^N^ (nWN- n^ r 
2N \n \m \ k r ' r + k + m 

m=0 fe=0 ^ / ^ / ^ / 

^ r(iV + 1) r(r + 1) 2Fi{n - N,T;n + T + l: -1) 
2^r(7V-n + l)r(n + T + l) 

(27) 

where 2^1 denotes the hypergeometric function. The last 
line is established by using the integral representation 
= Jp e~"^ dz for a = t + k + m, after which the 
two sums become simple binomial expansions. 

Depending on the value of r, equation ([?7)) predicts 
that the distribution of pairwise Hamming distance will 
interpolate between a symmetric binomial and a delta 
function at zero. This is illustrated in Fig. [3l which 
shows the distributions resulting from the values r = 1 
and T — 4. In both cases the deterministic theory pre- 
dicts a binomial distribution, as the values fj, are well 
within the stable region (see set of points '3' in Fig. [1]). 
As T increases, the left skew of the distribution becomes 
stronger, meaning that the population has grouped to- 
gether into tight clusters of genotypically similar organ- 



V. CONCLUSION 

To summarise, we have investigated a simple 
individual-based genetic evolutionary model, which is 
driven by the effects of mutation and competition. The- 
oretical analysis in the limit of large population size 
revealed several interesting phenomena. On a macro- 
scopic (deterministic) level, the model exhibits a pattern- 
forming transition whereby the decline of competition 
strength with genetic distance can drive the formation 
of genotypic clusters in an initially diverse population. 
On further investigation it was found that this pattern- 
forming process is highly amplified by the effects of de- 
mographic noise in the model. Large but finite popula- 
tions exhibit quasi-clustering when the mutation strength 
is relatively large while, more strikingly, lower mutation 
strengths lead to the formation of clearly distinct species 
which are not predicted by the deterministic analysis. We 
have demonstrated that the propensity to form species is 
determined by the average total number of mutations per 
generation in the population. 

The phenomenon of spontaneous speciation was first 
observed in the phenotypic version of the model In 
fact, whilst the combinatorial aspects are more involved, 
the essential flavour of the calculation presented here is 
the same. What we have achieved by introducing a ge- 
netic formulation is a step towards greater biological rel- 
evance, as well as providing further evidence that this 
mechanism of speciation is both general and robust. In 
forthcoming work (23| . we will examine the implications 
of this thesis in the wider context of population genetics. 

The biological relevance of the work could be further 
improved by consideration of a number of features which 
have been omitted from the model. These include: epis- 
tasis, and more generally the complex relationship be- 
tween genotype and phenotype; sexual reproduction and 
the emergence of reproductive isolation of species; hetero- 
geneity in the fitness landscape; geographic distribution 
of the population leading to allopatric/parapatric speci- 
ation. Inclusion of any of these features would provide a 
useful generalisation of the model. It is worth pointing 
out, however, that such considerations will not overturn 
our basic finding that demographic noise is itself a fun- 
damental force in the process of speciation. 
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A notable exception is [2^, who consider a complex net- 
work of possible organism types. 

Alternatively, our binary sequences can be thought of as 
a recuded model of DNA, with two nucleotides instead 
of four. 



Appendix A: Properties of H 

In the main text we claimed that the matrix 11 with 
entries II/j = (— l)!^!^-^! diagonalises any matrix whose 



entries are functions of Hamming distance. Before prov- 
ing this, we demonstrate some other useful properties of 
n. Firstly, for any /, J and K we have the identity 



H/ifllx,/ — H-KjQj ■ 



To see this, one must simply observe that 



(Al) 



\K n i\ + \K n J\ = 2\K n {I n J)\ + \K n {I e J)\ 



Secondly, 



(A2) 



K 



which implies that 11 is a multiple of its own inverse: 
= n. This follows from (|Jl|), and the fact that 



K 

This last relation can be seen to be true by noting that 
the rows of 11 are sequences of plus ones and minus ones, 
with equal numbers of each — except for the first row, 
which is all ones. 

Now, suppose is a matrix whose entries are deter- 
mined by Hamming distance according to Fij — /(|/0 
J|) for some function /. We compute 



1 

2^ 



Hi^H 



„ = ^En/i^/(l^©i|)nLj 

^ K.L 



J 



K.L.M 



^ niK^KM^LM^L,jf{\M\) 
K.L.AI 



Here the second line follows from application of the in- 
verse transform defined in ^ ; the third by the property 
(jAip : and the fourth from the sums over K and L col- 
lapsing to 2^(5/ _M and 2^(5j,m, respectively, according to 
[2|. 

As an example, the transform of the mutation matrix 
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Rij defined in Eq. ([T]) is 

ni?n 



1 

2lv 



1:1: ' 

fc=0 £=0 ^ 



J 

iV-|/| 



k=0 



(xi, . . . ,a;„) for the remaining degrees of freedom, and 
aim to derive an SDE for their behaviour under the con- 
straint. 

First, applying the Gram-Schmidt process to G{x) 
we can always write G{x) = LQ, where L is lower- 
triangular, Q is orthogonal, and both depend on x (al- 
though we have suppressed this in the notation). We 
separate L into the parts relevant to xo and a;* by writ- 
ing it in block form 



(1 - 2^l) 



\i\ 



(A3) 



L = 



The second line was obtained from the first by decompos- 
ing the sum over the sets J into the process of choosing 
k elements from / and £ from the compliment, to form a 



Loo 

where Loo is 1 x 1, L*o is n x 1 and L** is n x n. Note 
that B — GG^ — LL^, so writing B in block form also 
we obtain 



set of size k + £. 


( Boo 


Bo* ' 






V B^o 


B^-jf J 


-{ 



-'^00 



LooL^ 



LooL*o ^*o^ro + 



(B3) 



Appendix B: Conditioned Stochastic Differential 
Equations 

In our calculation for the strong-noise regime, we re- 
duced the number of stochastic degrees of freedom in 
the system by enforcing the condition y0 = 1. In this 
Appendix we show how conditioning a stochastic differ- 
ential equation (SDE) in this way alters the covariance 
structure of the noise experienced by the other variables. 
Applied to our system, the general derivation given here 
leads to Eq. ([25]) in the main text. 

Consider a vector of variables x = {xq,xi, ... ,a;„), 
satisfying the SDE 



Next we note that for any such state-dependent orthog- 
onal transformation, Q, the stochastic proce ss Q rjit) has 
the same statistics as r}{t) (see, for example, jl3|) and we 
may thus safely assume Q is the identity. 

Finally, imposing xo ~ c, we obtain 



Foix) 



^00 



(B4) 



For the remaining degrees of freedom, we arrive at 

^^0(3;) 



dx^ 



= F,{x) 



L 



*0" 



L 



00 



^=F{x)+G{x)rj{t)., 



(Bl) 



where r](t) is a vector of independent Gaussian white 
noise variables, and F and G are vector- and matrix- 
valued functions of the state x, respectively. Alterna- 
tively, we could have written the equivalent formulation 



or equivalently 



-Fo{x) — 



L^^ri^{t) , (B5) 



+ C(0, (B6) 



dx 



= F{x)+C{t), 



(B2) 



where the correlation matrix for the noise variables C* 
is simply L*h,LJ^, or, in terms of the original correlation 
matrix B: 



where ^(t) is a vector of correlated Gaussian white noise 
variables, with covariance matrix B = GG^ . That is, 

m)Q{n)^dit-t')B,^. 

Suppose we wish to impose upon the system the con- 
dition Xo = c, for some constant c. We write a;^ = 



L**L ^ 



B^oBo* 



B, 



00 



(B7) 



To obtain equations ([22]) and ([SB]) in the main text, we 
apply the condition = 1 to the system (PO)) . with the 
B matrix specified by Eq. (|2T]) . Note that in this case 
the right-hand side of Eq. (|B4p is zero. 



